Capstone Project - Early Marriage is still true in Vietnam?

By: Dexter Nguyen

Date: May 28, 2020

1. Importing Libraries

In [117]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib as mpl
import seaborn as sns

print('Libraries imported.')
Libraries imported.

2. Introduction: Business Problem

In [118]:
trenddata=pd.read_csv("Data11_Age_GetMarried_Trend.csv")
trenddata.set_index('Year', inplace=True)
trenddata.plot(kind='line', figsize=(12, 6)) # pass a tuple (x, y) size)
plt.title('Trend of Average Age at the First Marriage')
plt.ylabel('Average Age')
plt.xlabel('Year')
plt.show()

3. Data Analysis

3.1. Loading and Cleaning Marriage data

In [119]:
data=pd.read_csv("Data_Master.csv")
data.head()
Out[119]:
Province/City Area (km2) Population (Thousand) Density (/km2) Immigration Rate Migration Rate Net Migration Rate Age_First_Marriage Divoirce_Counts Marriage_Counts_1stTime Average_Children Monthly_Income Percentage_In_LaborForce Female/Male Ratio (%) Cost_Of_Living_Index Working_Hours/Week Working_Hours/Month
0 Ha Noi 3359 7521 2239 4.7 2.6 2.1 26.2 963 40990 2.07 6054 46.7 97.3 100.00 46.7 186.8
1 Vinh Phuc 1235 1092 884 2.0 1.2 0.8 23.7 273 6646 2.48 3698 22.0 98.1 92.62 45.0 180.0
2 Bac Ninh 823 1248 1516 11.1 2.0 9.1 24.0 251 8315 2.66 5445 27.9 94.5 94.95 50.2 200.8
3 Quang Ninh 6178 1267 205 1.4 3.2 -1.8 25.7 419 7384 2.22 4777 35.1 101.9 96.12 48.0 192.0
4 Hai Duong 1668 1808 1083 3.8 1.5 2.3 25.3 415 11775 2.59 3693 17.6 96.1 92.87 44.9 179.6
In [120]:
# let's examine the types of the column labels
all(isinstance(column, str) for column in data.columns)
Out[120]:
True
In [121]:
# View the information of the dataset 
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Province/City             63 non-null     object 
 1   Area (km2)                63 non-null     int64  
 2   Population (Thousand)     63 non-null     int64  
 3   Density (/km2)            63 non-null     int64  
 4   Immigration Rate          63 non-null     float64
 5   Migration Rate            63 non-null     float64
 6   Net Migration Rate        63 non-null     float64
 7   Age_First_Marriage        63 non-null     float64
 8   Divoirce_Counts           63 non-null     int64  
 9   Marriage_Counts_1stTime   63 non-null     int64  
 10  Average_Children          63 non-null     float64
 11  Monthly_Income            63 non-null     int64  
 12  Percentage_In_LaborForce  63 non-null     float64
 13  Female/Male Ratio (%)     63 non-null     float64
 14  Cost_Of_Living_Index      63 non-null     float64
 15  Working_Hours/Week        63 non-null     float64
 16  Working_Hours/Month       63 non-null     float64
dtypes: float64(10), int64(6), object(1)
memory usage: 8.5+ KB

3.2. Data Exploratory

In [122]:
# Shape of the data frame
data.shape
Out[122]:
(63, 17)
In [123]:
# Visualize the histograms of each variables
temp1=data.copy()

temp1.hist(bins=15, color='steelblue', edgecolor='black', linewidth=1.0,
          xlabelsize=8, ylabelsize=8, grid=False)    
plt.tight_layout(rect=(0, 0, 2, 2))
In [124]:
# Review covariance between attribute (including all data points)
import matplotlib.pyplot as plt

temp2=data.copy()

corr_matrix=temp2.corr()
corr_matrix.style.background_gradient(cmap='coolwarm')
Out[124]:
Area (km2) Population (Thousand) Density (/km2) Immigration Rate Migration Rate Net Migration Rate Age_First_Marriage Divoirce_Counts Marriage_Counts_1stTime Average_Children Monthly_Income Percentage_In_LaborForce Female/Male Ratio (%) Cost_Of_Living_Index Working_Hours/Week Working_Hours/Month
Area (km2) 1.000000 -0.035547 -0.492332 -0.198389 -0.135971 -0.135433 -0.313109 -0.244357 0.146864 0.343318 -0.525864 -0.229689 0.442744 0.137171 -0.087717 -0.087717
Population (Thousand) -0.035547 1.000000 0.785062 0.196711 -0.045897 0.199917 0.324187 0.702930 0.934898 -0.213952 0.555002 0.500578 -0.316104 0.402348 0.233922 0.233922
Density (/km2) -0.492332 0.785062 1.000000 0.258358 -0.086557 0.272275 0.294276 0.613809 0.594899 -0.251514 0.670424 0.543004 -0.466776 0.351100 0.275868 0.275868
Immigration Rate -0.198389 0.196711 0.258358 1.000000 0.007574 0.931564 0.017003 0.127032 0.057922 -0.223057 0.551301 0.178766 -0.320178 0.202139 0.345879 0.345879
Migration Rate -0.135971 -0.045897 -0.086557 0.007574 1.000000 -0.356445 0.110212 0.128525 0.052650 -0.202464 -0.047995 -0.377656 -0.088996 -0.382891 -0.209301 -0.209301
Net Migration Rate -0.135433 0.199917 0.272275 0.931564 -0.356445 1.000000 -0.024603 0.070853 0.034309 -0.134852 0.531947 0.304787 -0.265576 0.328954 0.398574 0.398574
Age_First_Marriage -0.313109 0.324187 0.294276 0.017003 0.110212 -0.024603 1.000000 0.522999 0.333452 -0.667641 0.568061 0.243605 -0.226361 -0.005061 0.031286 0.031286
Divoirce_Counts -0.244357 0.702930 0.613809 0.127032 0.128525 0.070853 0.522999 1.000000 0.678443 -0.533626 0.592911 0.165270 -0.316143 0.069545 -0.046819 -0.046819
Marriage_Counts_1stTime 0.146864 0.934898 0.594899 0.057922 0.052650 0.034309 0.333452 0.678443 1.000000 -0.174684 0.444649 0.365674 -0.219471 0.270276 0.165869 0.165869
Average_Children 0.343318 -0.213952 -0.251514 -0.223057 -0.202464 -0.134852 -0.667641 -0.533626 -0.174684 1.000000 -0.516828 -0.012554 0.144632 -0.084612 0.121046 0.121046
Monthly_Income -0.525864 0.555002 0.670424 0.551301 -0.047995 0.531947 0.568061 0.592911 0.444649 -0.516828 1.000000 0.563955 -0.407442 0.264964 0.433386 0.433386
Percentage_In_LaborForce -0.229689 0.500578 0.543004 0.178766 -0.377656 0.304787 0.243605 0.165270 0.365674 -0.012554 0.563955 1.000000 -0.140520 0.636313 0.480425 0.480425
Female/Male Ratio (%) 0.442744 -0.316104 -0.466776 -0.320178 -0.088996 -0.265576 -0.226361 -0.316143 -0.219471 0.144632 -0.407442 -0.140520 1.000000 0.090575 -0.070494 -0.070494
Cost_Of_Living_Index 0.137171 0.402348 0.351100 0.202139 -0.382891 0.328954 -0.005061 0.069545 0.270276 -0.084612 0.264964 0.636313 0.090575 1.000000 0.257225 0.257225
Working_Hours/Week -0.087717 0.233922 0.275868 0.345879 -0.209301 0.398574 0.031286 -0.046819 0.165869 0.121046 0.433386 0.480425 -0.070494 0.257225 1.000000 1.000000
Working_Hours/Month -0.087717 0.233922 0.275868 0.345879 -0.209301 0.398574 0.031286 -0.046819 0.165869 0.121046 0.433386 0.480425 -0.070494 0.257225 1.000000 1.000000
In [125]:
# Based on the abovementioned correlation analysis, we will move forward by dropping 8 variables: 
# Area (km2), 
# Population (Thousand), 
# Immigration Rate, 
# Migration Rate, 
# Marriage Counts of the first time, 
# Average Children, Costs of Living Index, and 
# Working Hours per month.
temp3 = temp2.drop(columns=['Migration Rate','Immigration Rate','Population (Thousand)','Area (km2)','Marriage_Counts_1stTime','Average_Children','Cost_Of_Living_Index','Working_Hours/Month'])
temp3.head()
Out[125]:
Province/City Density (/km2) Net Migration Rate Age_First_Marriage Divoirce_Counts Monthly_Income Percentage_In_LaborForce Female/Male Ratio (%) Working_Hours/Week
0 Ha Noi 2239 2.1 26.2 963 6054 46.7 97.3 46.7
1 Vinh Phuc 884 0.8 23.7 273 3698 22.0 98.1 45.0
2 Bac Ninh 1516 9.1 24.0 251 5445 27.9 94.5 50.2
3 Quang Ninh 205 -1.8 25.7 419 4777 35.1 101.9 48.0
4 Hai Duong 1083 2.3 25.3 415 3693 17.6 96.1 44.9
In [126]:
# Scatter Plots (all data points)
cols = temp3.columns
pp = sns.pairplot(temp3[cols], size=1.8, aspect=1.8,
                  plot_kws=dict(edgecolor="k", linewidth=0.5),
                  diag_kind="kde", diag_kws=dict(shade=True))

fig = pp.fig 
fig.subplots_adjust(top=0.93, wspace=0.3)
t = fig.suptitle('Data Attributes Pairwise Plots', fontsize=14)
/home/jupyterlab/conda/envs/python/lib/python3.6/site-packages/seaborn/axisgrid.py:2065: UserWarning: The `size` parameter has been renamed to `height`; pleaes update your code.
  warnings.warn(msg, UserWarning)
In [130]:
# Add a scatter plot showing the relationship between: Net Migration Rate vs Age_First_Marriage 
plt.scatter(temp3['Net Migration Rate'],temp3['Age_First_Marriage'],alpha=0.5, s=temp3['Monthly_Income'])
#plt.title('Germany price indices (2000-2018)')
plt.xlabel('Net Migration Rate')
plt.ylabel('Age_First_Marriage')
plt.show()
In [131]:
# Add a scatter plots showing the relationship between: Density (/km2) vs Age_First_Marriage
plt.scatter(temp3['Density (/km2)'],temp3['Age_First_Marriage'],alpha=0.5, s=temp3['Monthly_Income'])
#plt.title('Germany price indices (2000-2018)')
plt.xlabel('Density (/km2)')
plt.ylabel('Age_First_Marriage')
plt.show()
In [132]:
# Add a scatter plots showing the relationship between: Monthly_Income vs Age_First_Marriage 
plt.scatter(temp3['Monthly_Income'],temp3['Age_First_Marriage'],alpha=0.5, s=temp3['Monthly_Income'])
#plt.title('Germany price indices (2000-2018)')
plt.xlabel('Monthly_Income')
plt.ylabel('Age_First_Marriage')
plt.show()
In [133]:
# Add a scatter plots showing the relationship between: Divorce_Counts vs Age_First_Marriage 
plt.scatter(temp3['Divoirce_Counts'],temp3['Age_First_Marriage'],alpha=0.5, s=temp3['Monthly_Income'])
#plt.title('Germany price indices (2000-2018)')
plt.xlabel('Divoirce_Counts')
plt.ylabel('Age_First_Marriage')
plt.show()

4. Clustering analysis

4.1. Normalizing over the standard deviation

Normalization is a statistical method that helps mathematical-based algorithms to interpret features with different magnitudes and distributions equally. We use StandardScaler() to normalize our dataset.

In [134]:
from sklearn.preprocessing import StandardScaler

temp4=temp3.copy()

X = temp4.values[:,1:]
X = np.nan_to_num(X)
Clus_data = StandardScaler().fit_transform(X)
Clus_data
/home/jupyterlab/conda/envs/python/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype object was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
/home/jupyterlab/conda/envs/python/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype object was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
Out[134]:
array([[ 2.76157101e+00,  4.69740632e-01,  7.43527833e-01,
         1.47388841e+00,  2.32971963e+00,  3.75163991e+00,
        -4.26820706e-01,  7.64938365e-01],
       [ 6.08621528e-01,  2.88867300e-01, -8.05488485e-01,
        -4.91868937e-01,  2.81300834e-01,  3.73600989e-01,
        -1.74038525e-01,  9.23201475e-02],
       [ 1.61280165e+00,  1.44367396e+00, -6.19606527e-01,
        -5.54545258e-01,  1.80022597e+00,  1.18050097e+00,
        -1.31155834e+00,  2.14974058e+00],
       [-4.70236549e-01, -7.28793646e-02,  4.33724569e-01,
        -7.59260775e-02,  1.21943491e+00,  2.16519248e+00,
         1.02667683e+00,  1.27929347e+00],
       [ 9.24811156e-01,  4.97567298e-01,  1.85881958e-01,
        -8.73217722e-02,  2.76953596e-01, -2.28154933e-01,
        -8.05993977e-01,  5.27543700e-02],
       [ 1.25212303e+00,  1.77560634e-01,  2.47842611e-01,
         1.20893851e+00,  1.51417768e+00,  1.61814164e+00,
         1.10341428e-01,  1.31885925e-01],
       [ 1.23464521e+00,  3.86260632e-01, -4.33724569e-01,
        -4.14947997e-01,  4.07370751e-01,  1.41104383e-01,
        -7.42798432e-01,  2.11017480e-01],
       [ 9.99489109e-01, -2.25926030e-01,  3.09803264e-01,
         3.79901713e-01,  1.50014231e-01, -1.59773578e-01,
        -1.59593829e+00,  3.69280590e-01],
       [ 6.94421729e-01, -3.78972696e-01, -2.20128458e-15,
        -8.56531169e-01,  2.03050541e-01,  1.41104383e-01,
        -2.37234070e-01,  7.64938365e-01],
       [ 9.69300149e-01, -2.81579363e-01, -6.81567180e-01,
        -6.16814590e-02,  7.42480738e-03, -4.88004080e-01,
        -8.05993977e-01, -7.78126958e-01],
       [ 3.19443075e-01,  9.40806343e-02, -1.23921305e-01,
        -8.22344085e-01,  3.49987203e-01,  1.05741454e+00,
         3.31525836e-01,  9.23201475e-01],
       [-6.25948024e-01, -2.81579363e-01, -2.35450480e+00,
        -1.01037305e+00, -1.43411949e+00, -8.02558312e-01,
         1.02667683e+00,  1.43755658e+00],
       [-6.67259232e-01, -4.50526981e-02, -1.61097697e+00,
        -1.01607090e+00, -1.32022184e+00,  1.27428112e-01,
         1.73536973e-01, -1.84640295e-01],
       [-6.89503728e-01, -7.28793646e-02, -7.43527833e-01,
        -1.16991278e+00, -1.24284099e+00, -3.37565100e-01,
        -1.75392715e+00, -1.33204784e+00],
       [-5.84636816e-01, -2.53752697e-01, -1.11529175e+00,
        -7.53969916e-01, -9.68095517e-01, -9.33459762e-03,
        -1.15356948e+00, -1.84640295e-01],
       [-6.19592454e-01, -1.72260316e-02, -1.79685893e+00,
        -1.03886229e+00, -9.13320311e-01, -3.78593913e-01,
         1.08987238e+00, -7.38561180e-01],
       [-6.08470205e-01, -3.11393649e-02, -1.73489828e+00,
        -3.43724905e-01, -9.43750981e-01,  1.00075570e-01,
         3.94721381e-01,  5.27543700e-02],
       [-2.23958195e-01,  1.91473967e-01, -6.19606527e-02,
        -4.26343692e-01,  5.56046309e-01,  8.38594201e-01,
        -4.90016251e-01, -2.63771850e-02],
       [-6.45014735e-01,  6.62539678e-02, -2.47842611e-01,
        -9.10660719e-01, -1.15415733e+00,  4.53704861e-02,
         5.84308017e-01, -3.03337628e-01],
       [-1.06380143e-01,  1.06006348e-02, -3.09803264e-01,
         4.88160814e-01,  6.56778036e-02, -3.64917642e-01,
         4.71458829e-02,  9.62767253e-01],
       [-1.65169169e-01, -2.25926030e-01, -6.19606527e-01,
         1.80477055e-01, -4.19474015e-01,  3.46248447e-01,
        -3.95222933e-01,  1.55625392e+00],
       [-7.00625976e-01,  1.07993967e-01, -2.04470154e+00,
        -9.79034888e-01, -1.64974252e+00, -5.29032893e-01,
         5.52710244e-01, -1.25291629e+00],
       [-7.16514902e-01,  1.63647300e-01, -2.72626872e+00,
        -1.21834448e+00, -1.63670080e+00, -5.97414248e-01,
         1.31105679e+00, -1.56944251e+00],
       [-6.56136983e-01, -5.87672694e-01, -2.47842611e+00,
        -9.33452109e-01, -1.64452583e+00, -6.38443061e-01,
         8.68687970e-01, -1.05508740e-01],
       [-5.03603294e-01, -3.37232696e-01, -1.11529175e+00,
        -9.16358567e-01, -9.38534295e-01, -3.37565100e-01,
         4.26319154e-01, -1.17378473e+00],
       [-2.87513900e-01, -1.42446031e-01, -5.57645875e-01,
         7.16074709e-01, -3.13401395e-01,  8.63992989e-02,
         7.10699107e-01,  7.64938365e-01],
       [-4.90892153e-01, -3.23319363e-01,  2.47842611e-01,
         7.04679014e-01, -7.23780712e-01,  1.95809466e-01,
         1.41939201e-01, -4.61600738e-01],
       [-4.57525408e-01, -3.23319363e-01,  2.47842611e-01,
        -8.73624711e-01, -4.61207505e-01,  6.06097595e-01,
        -5.53211796e-01, -5.40732293e-01],
       [-6.19592454e-01, -2.53752697e-01,  6.19606527e-01,
        -5.14660326e-01, -6.15969197e-01,  6.60802678e-01,
         4.89514699e-01, -3.82469183e-01],
       [-5.79870138e-01, -1.84186030e-01,  1.85881958e-01,
        -1.04456013e+00, -7.23780712e-01,  8.11241659e-01,
        -6.16407341e-01,  7.64938365e-01],
       [-4.19391986e-01, -5.18106028e-01,  1.30117371e+00,
        -5.26056021e-01, -2.52540056e-01,  3.59924718e-01,
         2.99928064e-01, -2.63771850e-02],
       [ 5.40299146e-01,  8.31487296e-01,  1.30117371e+00,
        -6.25768350e-01,  1.85326228e+00,  3.19091280e+00,
        -1.74038525e-01,  1.55625392e+00],
       [-5.70336783e-01,  1.06006348e-02,  4.33724569e-01,
        -3.69365218e-01, -4.08171195e-01,  4.53704861e-02,
        -9.32385067e-01, -1.05508740e-01],
       [-4.03503060e-01, -2.81579363e-01,  3.09803264e-01,
        -6.28617274e-01, -4.13387881e-01, -2.41831204e-01,
         1.18466570e+00,  3.69280590e-01],
       [-3.93969704e-01, -5.89660313e-02,  4.33724569e-01,
        -2.49710423e-01, -3.04706918e-01,  1.82133196e-01,
        -9.95580612e-01,  1.55625392e+00],
       [-5.08369971e-01, -2.95492696e-01,  3.09803264e-01,
        -5.94430189e-01, -4.67293639e-01, -6.52119332e-01,
         4.57916927e-01,  2.90149035e-01],
       [-4.14625308e-01,  5.23406346e-02,  2.16862285e+00,
        -3.23782439e-01,  7.00250421e-02, -7.77159523e-02,
        -5.53211796e-01, -7.38561180e-01],
       [-5.06781079e-01,  3.84273013e-02,  5.57645875e-01,
        -5.51696334e-01, -6.46399866e-01, -6.40396814e-02,
         9.95079061e-01, -1.05508740e+00],
       [-5.48092286e-01, -4.50526981e-02,  8.67449138e-01,
         5.05254356e-01,  6.04611173e-02, -7.34176957e-01,
         5.21112472e-01,  6.06675255e-01],
       [-7.08570439e-01, -1.00706031e-01, -2.20128458e-15,
        -1.00752412e+00, -1.18893523e+00, -1.46097307e-01,
         3.99686746e+00,  4.08846368e-01],
       [-6.46603628e-01, -1.72260316e-02, -5.57645875e-01,
        -7.34027450e-01, -6.85525013e-01, -1.21284644e+00,
         2.68330291e-01,  7.25372588e-01],
       [-5.62392320e-01, -2.12012697e-01,  1.85881958e-01,
        -2.35465804e-01, -5.45543932e-01, -7.20500686e-01,
         8.37090198e-01, -5.80298070e-01],
       [-6.38659165e-01, -2.12012697e-01, -5.57645875e-01,
        -8.90718253e-01, -3.00359680e-01, -7.75205770e-01,
         1.53224119e+00,  8.44069920e-01],
       [-5.83047924e-01,  8.01673010e-02,  6.19606527e-01,
        -2.46454510e-02,  2.30872867e-01, -3.64917642e-01,
         9.00285743e-01, -4.61600738e-01],
       [-5.70336783e-01,  8.01673010e-02,  1.23921305e-01,
        -1.67091636e-01,  1.98703302e-01, -3.37565100e-01,
         6.79101335e-01,  1.71451703e-01],
       [-3.51069604e-01,  6.62539678e-02,  6.81567180e-01,
         1.27161483e+00,  7.68191549e-01, -5.15356622e-01,
         1.15306792e+00, -3.42903405e-01],
       [ 4.79921228e-01,  6.84204725e+00, -4.95685222e-01,
         6.65201072e-02,  2.99832491e+00,  1.95809466e-01,
        -1.65913384e+00,  2.07060902e+00],
       [ 3.97979766e-02,  8.73227295e-01,  1.67293762e+00,
         1.77302540e+00,  1.67328661e+00,  1.27428112e-01,
        -1.65913384e+00,  5.27543700e-01],
       [ 9.69981103e-02,  8.01673010e-02,  1.67293762e+00,
         3.88448484e-01,  1.30985747e+00,  7.42860304e-01,
         6.47503562e-01,  1.23972770e+00],
       [ 5.83131151e+00,  1.02627396e+00,  1.67293762e+00,
         4.25728686e+00,  2.43666169e+00,  2.41136536e+00,
        -2.03830711e+00,  9.62767253e-01],
       [-2.65269403e-01, -5.04192695e-01,  3.71763916e-01,
         1.44539917e+00,  7.30805297e-01, -4.33298997e-01,
         1.55481103e-02, -1.56944251e+00],
       [ 3.19443075e-01,  6.62539678e-02,  2.47842611e-01,
         2.43682462e+00,  5.29093430e-01, -9.80349834e-01,
        -7.42798432e-01, -1.84640295e-01],
       [ 4.61535470e-02, -4.20712695e-01,  4.33724569e-01,
         9.32592910e-01,  2.91610000e-02, -1.37696169e+00,
        -5.21614024e-01, -1.52987673e+00],
       [-8.89023242e-02, -1.38073269e+00, -1.23921305e-01,
        -2.89595354e-01, -4.39471313e-01, -1.14446509e+00,
        -1.12197170e+00,  1.71451703e-01],
       [ 2.98787471e-01,  6.62539678e-02,  6.19606527e-01,
         8.47125199e-01, -2.48192818e-01, -1.73449849e-01,
        -4.26820706e-01, -1.29248207e+00],
       [-1.51323105e-03, -3.37232696e-01,  6.81567180e-01,
         1.57685665e-01,  1.08280741e-01, -1.11711254e+00,
         1.73536973e-01, -9.36390068e-01],
       [ 1.76442740e-01, -1.19985936e+00,  1.23921305e-01,
         9.63931070e-01,  1.60447603e-01, -8.16234583e-01,
        -1.74038525e-01,  4.08846368e-01],
       [-3.43125141e-01, -6.43326027e-01,  4.33724569e-01,
         4.45426958e-01,  3.51726098e-01, -5.42709164e-01,
         3.31525836e-01,  2.90149035e-01],
       [ 6.19743777e-01, -7.28793646e-02,  7.43527833e-01,
         9.92420307e-01,  8.66439139e-01,  6.60802678e-01,
         2.05134746e-01,  1.31885925e-02],
       [-3.48799757e-02, -3.37232696e-01,  7.43527833e-01,
         2.17513063e-01,  1.50883679e-01, -1.30858034e+00,
        -1.94351379e+00, -1.88596873e+00],
       [-1.65169169e-01, -1.83987269e+00,  7.43527833e-01,
         1.00707191e-01,  2.41306240e-01, -1.10343627e+00,
        -7.92452074e-02, -5.80298070e-01],
       [-2.62091618e-01, -7.54632693e-01,  4.95685222e-01,
        -1.32904551e-01, -5.87277422e-01, -1.51372440e+00,
         1.10341428e-01, -1.01552162e+00],
       [-4.20980878e-01, -7.40719360e-01,  7.43527833e-01,
         1.91262266e+00, -3.37745931e-01, -9.39321021e-01,
         4.89514699e-01, -3.15207361e+00]])

4.2. Modeling

In [139]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=1000, n_init=10, random_state=0)
    kmeans.fit(Clus_data)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
In [140]:
clusterNum = 3
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)
k_means.fit(Clus_data)
labels = k_means.labels_
print(labels)
[2 1 2 1 1 2 1 1 1 1 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 1 2 1 1 1 1 1
 1 1 0 0 1 0 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1]
In [141]:
# We assign the labels to each row in dataframe.
temp4["Cluster"] = labels
temp4.head(5)
Out[141]:
Province/City Density (/km2) Net Migration Rate Age_First_Marriage Divoirce_Counts Monthly_Income Percentage_In_LaborForce Female/Male Ratio (%) Working_Hours/Week Cluster
0 Ha Noi 2239 2.1 26.2 963 6054 46.7 97.3 46.7 2
1 Vinh Phuc 884 0.8 23.7 273 3698 22.0 98.1 45.0 1
2 Bac Ninh 1516 9.1 24.0 251 5445 27.9 94.5 50.2 2
3 Quang Ninh 205 -1.8 25.7 419 4777 35.1 101.9 48.0 1
4 Hai Duong 1083 2.3 25.3 415 3693 17.6 96.1 44.9 1

4.3. Insights

In [142]:
# We can easily check the centroid values by averaging the features in each cluster
temp4.groupby('Cluster').mean()
Out[142]:
Density (/km2) Net Migration Rate Age_First_Marriage Divoirce_Counts Monthly_Income Percentage_In_LaborForce Female/Male Ratio (%) Working_Hours/Week
Cluster
0 95.857143 -2.107143 22.800000 121.642857 2058.285714 16.400000 100.757143 44.157143
1 448.404762 -2.995238 25.559524 490.142857 3413.214286 18.052381 98.488095 44.478571
2 1626.428571 10.700000 26.042857 826.714286 5774.285714 32.300000 95.414286 47.714286
In [143]:
# Now, lets look at the distribution of customers based on Age at the first marriage and Monthly Income

plt.scatter(X[:, 2], X[:, 4], c=labels.astype(np.float), alpha=0.5)
plt.xlabel('Age_First_Marriage', fontsize=18)
plt.ylabel('Monthly_Income', fontsize=16)

plt.show()
In [144]:
# Now, lets look at the distribution of customers based on Age at the first marriage and Density (km2)

plt.scatter(X[:, 2], X[:, 0], c=labels.astype(np.float), alpha=0.5)
plt.xlabel('Age_First_Marriage', fontsize=18)
plt.ylabel('Density (/km2)', fontsize=16)

plt.show()
In [145]:
# Visualize the clusters on map
temp5=temp4.copy()

temp5['Cluster'] = temp5['Cluster'].astype('category')
temp5 = temp5.rename(columns={"Province/City": "province"})
temp5.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   province                  63 non-null     object  
 1   Density (/km2)            63 non-null     int64   
 2   Net Migration Rate        63 non-null     float64 
 3   Age_First_Marriage        63 non-null     float64 
 4   Divoirce_Counts           63 non-null     int64   
 5   Monthly_Income            63 non-null     int64   
 6   Percentage_In_LaborForce  63 non-null     float64 
 7   Female/Male Ratio (%)     63 non-null     float64 
 8   Working_Hours/Week        63 non-null     float64 
 9   Cluster                   63 non-null     category
dtypes: category(1), float64(5), int64(3), object(1)
memory usage: 4.7+ KB
In [146]:
# Change city name to match json file
temp5['json_name']=temp4['Province/City']
temp5.loc[temp5.province[temp5.province == 'Ho Chi Minh City'].index.tolist(),'json_name']='TP. Ho Chi Minh'
temp5.loc[temp5.province[temp5.province == 'Thua Thien Hue'].index.tolist(),'json_name']='Thua Thien - Hue'
temp5.loc[temp5.province[temp5.province == 'Ba Ria Vung Tau'].index.tolist(),'json_name']='Ba Ria - Vung Tau'
temp5.loc[temp5.province[temp5.province == 'Lang son'].index.tolist(),'json_name']='Lang Son'
temp5.loc[temp5.province[temp5.province == 'Dac Lak'].index.tolist(),'json_name']='Dak Lak'
temp5.head(5)
Out[146]:
province Density (/km2) Net Migration Rate Age_First_Marriage Divoirce_Counts Monthly_Income Percentage_In_LaborForce Female/Male Ratio (%) Working_Hours/Week Cluster json_name
0 Ha Noi 2239 2.1 26.2 963 6054 46.7 97.3 46.7 2 Ha Noi
1 Vinh Phuc 884 0.8 23.7 273 3698 22.0 98.1 45.0 1 Vinh Phuc
2 Bac Ninh 1516 9.1 24.0 251 5445 27.9 94.5 50.2 2 Bac Ninh
3 Quang Ninh 205 -1.8 25.7 419 4777 35.1 101.9 48.0 1 Quang Ninh
4 Hai Duong 1083 2.3 25.3 415 3693 17.6 96.1 44.9 1 Hai Duong
In [147]:
import seaborn as sns
import folium
import os
import json
import requests

#address= 'Vietnam'
#geolocator = Nominatim()
#location = geolocator.geocode(address)

vn=folium.Map(
    location=[13.2904027, 108.4265113],
    zoom_start=5)

url = 'https://data.opendevelopmentmekong.net/dataset/999c96d8-fae0-4b82-9a2b-e481f6f50e12/resource/2818c2c5-e9c3-440b-a9b8-3029d7298065/download/diaphantinhenglish.geojson'
vn_geo=json.loads(requests.get(url).text)

vn.choropleth(
    geo_data=vn_geo,
    data=temp5,
    columns=['json_name','Cluster'],
    key_on='feature.properties.Name',
    fill_color='OrRd', 
    fill_opacity=0.75, 
    line_opacity=0.2,
    legend_name='Cluster',
    threshold_scale=[0, 1, 2, 3])
vn
Out[147]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [148]:
# Examine the first cluster
temp5[temp5['Cluster'] == 0]
Out[148]:
province Density (/km2) Net Migration Rate Age_First_Marriage Divoirce_Counts Monthly_Income Percentage_In_LaborForce Female/Male Ratio (%) Working_Hours/Week Cluster json_name
11 Ha Giang 107 -3.3 21.2 91 1725 13.4 101.9 48.4 0 Ha Giang
12 Cao Bang 81 -1.6 22.4 89 1856 20.2 99.2 44.3 0 Cao Bang
13 Bac Kan 67 -1.8 23.8 35 1945 16.8 93.1 41.4 0 Bac Kan
14 Tuyen Quang 133 -3.1 23.2 181 2261 19.2 95.0 44.3 0 Tuyen Quang
15 Lao Cai 111 -1.4 22.1 81 2324 16.5 102.1 42.9 0 Lao Cai
16 Yen Bai 118 -1.5 22.2 325 2289 20.0 99.9 44.9 0 Yen Bai
18 Lang son 95 -0.8 24.6 126 2047 19.6 100.5 44.0 0 Lang Son
21 Dien Bien 60 -0.5 21.7 102 1477 15.4 100.4 41.6 0 Dien Bien
22 Lai Chau 50 -0.1 20.6 18 1492 14.9 102.8 40.8 0 Lai Chau
23 Son La 88 -5.5 21.0 118 1483 14.6 101.4 44.5 0 Son La
24 Hoa Binh 184 -3.7 23.2 124 2295 16.8 100.0 41.8 0 Hoa Binh
39 Kon Tum 55 -2.0 25.0 92 2007 18.2 111.3 45.8 0 Kon Tum
40 Gia Lai 94 -1.4 24.1 188 2586 10.4 99.5 46.6 0 Gia Lai
42 Dak Nong 99 -2.8 24.1 133 3029 13.6 103.5 46.9 0 Dak Nong
In [149]:
# Examine the second cluster
temp5[temp5['Cluster'] == 1]
Out[149]:
province Density (/km2) Net Migration Rate Age_First_Marriage Divoirce_Counts Monthly_Income Percentage_In_LaborForce Female/Male Ratio (%) Working_Hours/Week Cluster json_name
1 Vinh Phuc 884 0.8 23.7 273 3698 22.0 98.1 45.0 1 Vinh Phuc
3 Quang Ninh 205 -1.8 25.7 419 4777 35.1 101.9 48.0 1 Quang Ninh
4 Hai Duong 1083 2.3 25.3 415 3693 17.6 96.1 44.9 1 Hai Duong
6 Hung Yen 1278 1.5 24.3 300 3843 20.3 96.3 45.3 1 Hung Yen
7 Thai Binh 1130 -2.9 25.5 579 3547 18.1 93.6 45.7 1 Thai Binh
8 Ha Nam 938 -4.0 25.0 145 3608 20.3 97.9 46.7 1 Ha Nam
9 Nam Dinh 1111 -3.3 23.9 424 3383 15.7 96.1 42.8 1 Nam Dinh
10 Ninh Binh 702 -0.6 24.8 157 3777 27.0 99.7 47.1 1 Ninh Binh
17 Thai Nguyen 360 0.1 24.9 296 4014 25.4 97.1 44.7 1 Thai Nguyen
19 Bac Giang 434 -1.2 24.5 617 3450 16.6 98.8 47.2 1 Bac Giang
20 Phu Tho 397 -2.9 24.0 509 2892 21.8 97.4 48.7 1 Phu Tho
25 Thanh Hoa 320 -2.3 24.1 697 3014 19.9 100.9 46.7 1 Thanh Hoa
26 Nghe An 192 -3.6 25.4 693 2542 20.7 99.1 43.6 1 Nghe An
27 Ha Tinh 213 -3.6 25.4 139 2844 23.7 96.9 43.4 1 Ha Tinh
28 Quang Binh 111 -3.1 26.0 265 2666 24.1 100.2 43.8 1 Quang Binh
29 Quang Tri 136 -2.6 25.3 79 2542 25.2 96.7 46.7 1 Quang Tri
30 Thua Thien Hue 237 -5.0 27.1 261 3084 21.9 99.6 44.7 1 Thua Thien - Hue
32 Quang Nam 142 -1.2 25.7 316 2905 19.6 95.7 44.5 1 Quang Nam
33 Quang Ngai 247 -3.3 25.5 225 2899 17.5 102.4 45.7 1 Quang Ngai
34 Binh Dinh 253 -1.7 25.7 358 3024 20.6 95.5 48.7 1 Binh Dinh
35 Phu Yen 181 -3.4 25.5 237 2837 14.5 100.1 45.5 1 Phu Yen
36 Khanh Hoa 240 -0.9 28.5 332 3455 18.7 96.9 42.9 1 Khanh Hoa
37 Ninh Thuan 182 -1.0 25.9 252 2631 18.8 101.8 42.1 1 Ninh Thuan
38 Binh Thuan 156 -1.6 26.4 623 3444 13.9 100.3 46.3 1 Binh Thuan
41 Dac Lak 147 -2.8 25.3 363 2747 14.0 101.3 43.3 1 Dak Lak
43 Lam Dong 134 -0.7 26.0 437 3640 16.6 101.5 43.6 1 Lam Dong
44 Binh Phuoc 142 -0.7 25.2 387 3603 16.8 100.8 45.2 1 Binh Phuoc
45 Tay Ninh 280 -0.8 26.1 892 4258 15.5 102.3 43.9 1 Tay Ninh
48 Ba Ria Vung Tau 562 -0.7 27.7 582 4881 24.7 100.7 47.9 1 Ba Ria - Vung Tau
50 Long An 334 -4.9 25.6 953 4215 16.1 98.7 40.8 1 Long An
51 Tien Giang 702 -0.8 25.4 1301 3983 12.1 96.3 44.3 1 Tien Giang
52 Ben Tre 530 -4.3 25.7 773 3408 9.2 97.0 40.9 1 Ben Tre
53 Tra Vinh 445 -11.2 24.8 344 2869 10.9 95.1 45.2 1 Tra Vinh
54 Vinh Long 689 -0.8 26.0 743 3089 18.0 97.3 41.5 1 Vinh Long
55 Dong Thap 500 -3.7 26.1 501 3499 11.1 99.2 42.4 1 Dong Thap
56 An Giang 612 -9.9 25.2 784 3559 13.3 98.1 45.8 1 An Giang
57 Kien Giang 285 -5.9 25.7 602 3779 15.3 99.7 45.5 1 Kien Giang
58 Can Tho 891 -1.8 26.2 794 4371 24.1 99.3 44.8 1 Can Tho
59 Hau Giang 479 -3.7 26.2 522 3548 9.7 92.5 40.0 1 Hau Giang
60 Soc Trang 397 -14.5 26.2 481 3652 11.2 98.4 43.3 1 Soc Trang
61 Bac Lieu 336 -6.7 25.8 399 2699 8.2 99.0 42.2 1 Bac Lieu
62 Ca Mau 236 -6.6 26.2 1117 2986 12.4 100.2 36.8 1 Ca Mau
In [150]:
# Examine the third cluster
temp5[temp5['Cluster'] == 2]
Out[150]:
province Density (/km2) Net Migration Rate Age_First_Marriage Divoirce_Counts Monthly_Income Percentage_In_LaborForce Female/Male Ratio (%) Working_Hours/Week Cluster json_name
0 Ha Noi 2239 2.1 26.2 963 6054 46.7 97.3 46.7 2 Ha Noi
2 Bac Ninh 1516 9.1 24.0 251 5445 27.9 94.5 50.2 2 Bac Ninh
5 Hai Phong 1289 0.0 25.4 870 5116 31.1 99.0 45.1 2 Hai Phong
31 Da Nang 841 4.7 27.1 226 5506 42.6 98.1 48.7 2 Da Nang
46 Binh Duong 803 47.9 24.2 469 6823 20.7 93.4 50.0 2 Binh Duong
47 Dong Nai 526 5.0 27.7 1068 5299 20.2 93.4 46.1 2 Dong Nai
49 Ho Chi Minh City 4171 6.1 27.7 1940 6177 36.9 92.2 47.2 2 TP. Ho Chi Minh
In [ ]: